DATA SCIENCE SESSIONS VOL. 3

A Foundational Python Data Science Course

Session 13: Simple Linear Regression. Parametric bootstrap.

← Back to course webpage

Feedback should be send to goran.milovanovic@datakolektiv.com.

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

Lecturers

Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner

Aleksandar Cvetković, PhD, DataKolektiv, Consultant

Ilija Lazarević, MA, DataKolektiv, Consultant


1. Simple Linear Regression

Target: predict Weight from Height

Linear model has the form

$$y = \beta_1 x + \beta_0 + \varepsilon,$$

where

The predicted value $\hat{y}$ of the target variable is computed via Liner regression via

$$\hat{y} = \beta_1 x + \beta_0.$$

Ok, statsmodels can do it; how do we find out about the optimal values of $\beta_0$ and $\beta_1$? Let's build ourselves a function that (a) tests for some particular values of $\beta_0$ and $\beta_1$ for a particular regression problem (i.e. for a particular dataset) and returns the model error.

The model error? Oh. Remember the residuals:

$$\epsilon_i = y_i - \hat{y_i}$$

where $y_i$ is the observation to be predicted, and $\hat{y_i}$ the actual prediction?

Next we do something similar to what happens in the computation of variance, square the differences:

$$\epsilon_i^2 = (y_i - \hat{y_i})^2$$

and define the model error for all observations to be the sum of squares:

$$SSE = \sum_{i=1}^{N}(y_i - \hat{y_i})^2$$

Obviously, the lower the $SSE$ - the Sum of Squared Error - the better the model! Here's a function that returns the SSE for a given data set (with two columns: the predictor and the criterion) and a choice of parameters $\beta_0$ and $\beta_1$:

Test lg_sse() now:

Check via statsmodels:

Method A. Random parameter space search

Check with statsmodels:

Not bad, how about 100,000 random pairs?

Method B. Grid search

A grid more dense:

Check with statsmodels:

Method C. Optimization (the real thing)

The Method of Least Squares

Check against statsmodels

Final value of the objective function (the model SSE, indeed):

Check against statsmodels

Error Surface Plot: The Objective Function

This the function that we have minimized:

Back to statsmodels

Linear Regression using scikit-learn

2. Parametric Bootstrap

Bias and Variance via the Bootstrap

First: the model parameters and their standard errors

Second: the standard deviation of model residuals

Third: the Sim-Fit Loop, Parametric Bootstrap

Compare with model parameter variances as estimated from the original Linear Regression

Further Reading


DataKolektiv, 2022/23.

hello@datakolektiv.com

License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.